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Abstract. We present an algorithm for phylogenetic reconstruction using quartets that returns the 
correct topology for n taxa in 0(n log n) time with high probability, in a probabilistic model where a 
quartet is not consistent with the true topology of the tree with constant probability, independent of 
other quartets. Our incremental algorithm relies upon a search tree structure for the phylogeny that is 
balanced, with high probability, no matter what the true topology is. Our experimental results show that 
our method is comparable in runtime to the fastest heuristics, while still offering consistency guarantees. 



1 Introduction 

Incremental phylogenetic reconstruction algorithms add new taxa to a topology until all n taxa have been 
added. They optimize a greedy objective at all n insertions, much as agglomerative algorithms (like neighbour 
joining or UPGMA) optimize an objective at all n — 1 agglomerations. Such algorithms can be quite efficient. 
If each addition requires 0(f(n)) time, the overall runtime is 0(nf(n)). 

We give an algorithm where each insertion requires O(logn) runtime with high probability, and where the 
probability that any insertion is incorrect is o(l) in a simple error model. Thus, our randomized algorithm has 
runtime 0(n log n) with high probability (regardless of the true topology) and o(l) probability of producing 
an incorrect topology. We believe it is the first O (npoly log n)-runtime algorithm with such guarantees. Any 
o(n log n)-runtime algorithm cannot return all topologies, so our algorithm is asymptotically optimal. 

We present a review of related work, give basic definitions, and then give the algorithm in the case 
of error-free data. Then, we extend the algorithm to the case of data containing noise. Finally, we give 
some experimental results on real and simulated data. Our error-tolerant algorithms offer the possibility of 
producing a phylogenetic tree in runtime smaller than that of producing even the input matrix to a distance 
method like neighbour joining, while still having high probability of reconstructing the true tree. 

2 Related work 

Phylogenetic quartet methods reconstruct trees from sets of four taxa and combine these phylogenies into 
the overall tree. Quartet puzzling [18] is one of the first algorithms in this line of research. Many heuristic 
algorithms also operate on this principle (e.g. |15I16| ). 

Some quartet algorithms find the correct phylogeny with high probability under a certain model of evolu- 
tion. Erdos et al. [7] give an 0(n 4 log n) algorithm that reconstructs the phylogeny with 1 — o(l) probability 
assuming that the sequences evolve according to the Cavender-Farris model of evolution, for sufficiently long 
sequences. The runtime of their algorithm is 0(n 2 ) for most trees. Csuros [I] provided a practical 0(n 2 ) 
algorithm with similar performance guarantees. Recent papers 10 5j give similar algorithms to identify parts 
of the tree that can be reconstructed. These approaches choose queries so that, in the assumed model of 
evolution, all queries are correct with high probability. 

The only sub-quadratic time algorithm with guarantees on reconstruction accuracy is by King et al. [TS] ■ 
The running time is 0(n 2 l °f^°f l n ) provided that the sequences are long enough. 

Wu et al. |19j gave a simple error model where each quartet query independently errs with fixed probability 
p. They gave an 0(n 4 log n) algorithm that errs with constant probability under this model. This model has 
also been used for evaluating algorithms for maximum quartet consistency |20) . 

We improve on Wu et al. in runtime and accuracy with an O(nlogn) algorithm that errs with probability 
o(l). To our knowledge, it is the first provably error-tolerant, substantially sub-quadratic time algorithm for 
phylogenetic reconstruction. (Recently, an 0(n 15 ) heuristic algorithm has been proposed [14] .) 

Fast algorithms have been proposed for error-free data. Kannan et al. [TT] use error-free rooted triples in 
an O(nlogn) algorithm. Rooted triples reduce to quartets if we pick one taxon as an outgroup and always 
ask quartet queries for sets with that taxon, so that algorithm works for error-free quartets. 

Our algorithm uses ideas from work on noisy binary search in which comparisons have fixed error proba- 
bility, by Feige et al. [8] and Karp and Kleinberg [12]. 

3 Definitions 

We begin with definitions about the two trees we will focus on: the phylogeny we are reconstructing and the 
search tree that allows us to do the insertions. 

A phylogeny T is an unrooted binary tree with n leaves in 1-to-l correspondence with a set S of taxa. 
Removing internal node v, and its incident edges, from a phylogeny yields three subtrees, U(T, v) for i = 1,2, 3. 
The tree ti(T,v) joined with its edge to v is the child subtree Ci(T,v). Phylogeny T' is consistent with T if 
its taxa are a subset of those of T, and T' is formed by the union of all paths in T between taxa in T", with 
internal nodes of degree 2 removed. A border node of subtree T' in T is any internal node of T that is a leaf 
in T . 

A quartet is a phylogeny of four taxa. A quartet query q(a,b,c,d), returns one of three possible quartet 
topologies: ab\cd, ac\bd and ad\bc, where in ab\cd, if we remove the internal edge, we disconnect {a, b} from 
{c, d}. We assume a quartet query can be done in O(l) time. In Section [5] our error model considers how 
often quartet queries for four taxa of T are inconsistent with T. A node query N(T,v,x) for internal node v 



of phylogeny T and new taxon a; is a quartet query q(x, Oi, a.2, 03), where eij is a leaf of T in U(T, v). Such a 
query identifies the Cj(T, u) where taxon x belongs, if it is consistent with the true topology. 

3.1 Search tree 

A natural algorithm to add taxon x to phylogeny T begins at an internal node v and uses node query N(T, v, x) 
to identify the U(T, v) where taxon x belongs. We move to the neighbour of v in that subtree, and repeat the 
process until the subtree into which x is to be placed is only one edge e, which we break into two edges and 
hang x onto; see Figure [T] We follow the path from v to an endpoint of e and identify the other endpoint 
with one more query. The number of node queries equals this path length plus one. For a balanced tree with 
diameter 0(logn), this gives a 6>(nlogn) incremental phylogeny algorithm. But for trees like a caterpillar 
tree, with (9(n) diameter, this algorithm requires 0(n?) queries. 



Fig. 1. Natural incremental algorithm: start at root and search to find place for new taxon -k-j by asking queries down 
the path. Break an edge to insert the new taxon. 

We give a search tree structure to manage the expected number of queries on the search path, regardless 
of the underlying tree topology. 

Definition 1. A search tree Y(T) for a phylogeny T is a rooted ternary tree satisfying the following condi- 
tions: 

1. Each node y in Y(T) is associated with a distinct subtree r(y) ofT. 

2. The root ofY{T) is associated with the full tree T. 

3. For each internal node y in Y(T), there exists an internal node s(y) in T such that the three subtrees 
associated with the children of y are the intersections between r(y) and the three child subtrees of the node 
s(y) in T. There are also three nonempty lists £i(y) stored at each internal node y; each element of£{(y) 
is a taxon in ti(T,y). 

4- For each node y in Y(T), r(y) has at most two border nodes in T 

Y(T) is complete if each leaf in Y(T) is associated with a single edge of T, and each edge of T has a 
corresponding leaf in Y(T). For a given node y in the search tree, its associated node s(y) in T may be picked 
so the three child subtrees are reasonably balanced; this gives expected O(logn) insertion time. See Figure [2] 
for an example. 

4 An algorithm for error-free data 

Using our search tree structure gives a straightforward incremental phylogeny algorithm if quartets are all 
consistent with T, the true topology. 

We pick a random permutation ir of the taxa, and start with the unique topology T3 for {■ni, K2, ^3}, and 
a search tree Y(T^) with four nodes: a root w with r(w) — T3 and s(w) the internal node of T3, and with one 
leaf for each edge of T3. We also store £i(w) = {71^}; we also use ii(w) to represent the unique member of this 
set. This fits our requirements for a complete search tree of T3. 

Now, assuming Ti is consistent with T, and Y(Ti) is a valid search tree for T^, we add T^i+i, to produce 
Ti + i and Y(Ti + i). We start at the root w of Y(Ti) and ask the node query N(Ti, s(w),iTi+i) using the quartet 
q(7Ti+i,£i(w),£2(u , ),£3(w)); this tells us which child of w we should move to next. We continue until we reach 
a leaf y of Y(T,i)\ this corresponds to the edge e of where the new taxon 7r i+1 belongs. We break edge e 
into two parts, creating a new node u and a new edge from u to the new leaf 7^+1. The new tree is Tj+i. 

To update Y(Ti), we create three edges from y to a new node for each of the three newly created edges 
and let £\{y) be {71^+1}, and set £2(1/) and £3(11) to contain the taxon closest to 7Ti+i in the final quartet query 




7Ti 7T5 7T2 7T4 TTq 7Ts 



7Tl 7T57T2 TV 4 7T7 TTq 7T3 



Fig. 2. A search tree for a seven-taxon phylogeny. Directed search tree edges are shown in solid lines; the underlying 
phylogeny is in dotted lines. The search tree node y corresponds to the region r(y) of the phylogeny indicated by the 
cloud. 




Fig. 3. Inserting into a search tree. To insert 7rs into the phylogeny, we follow the path through the search tree indicated 
with double arrows. We find the correct edge to break to add 7rg to the tree, and modify the search tree locally to 
accommodate the change. 

and one of the two taxa that was not closest to 7r i+1 in that query. Since node y was a leaf in Y(Ti), these 
nodes are in proper configuration with respect to y in Ti + \. See Figure [3] 

Assuming the quartet queries all are consistent with the true topology T, we discover in this way the 
proper place in the tree to insert each new taxon and maintain the invariants required for a complete search 
tree. In particular, the only subtrees whose border nodes need to be considered are those created by the new 
node addition, and as they are all either single edges or derived from a single edge in Y(Ti), they continue to 
have at most two border nodes. 

Theorem 1. If all quartet queries made by this algorithm are consistent with T , then this algorithm returns 
T. Its runtime is O(nlogn) with probability 1 — o(l). 

Proof. We have seen that the algorithm returns T. In the next subsection, we show that inserting taxon 7Tj 
requires O(logn) queries with high probability, each of which requires 0(1) time; the work to create a new 
edge requires constant time. The overall runtime is O(nlogn) with high probability. 

4.1 The height of the search tree 

To prove Theorem [Tj we need to know the height of the search tree Y(T). We will show that this tree is 
almost surely balanced, using several lemmas. 

Lemma 1. For any phylogeny T, with n taxa, there exist two disjoint child subtrees A and B of the form 
U(T,v) with at least n/6 and at most n/3 taxa. 

Proof. We first show there exists a node u where all ti(T,u) have at most n/2 taxa. Pick an internal node u 
in T; if all ti(T,u) have at most n/2 taxa, we are done. Otherwise, move to the its neighbour in the U(T,u) 
with the most taxa. This process terminates at a node u satisfying the property. Let n\ < n% < be the 
numbers of taxa in the trees ti(T,u) at some step. If n\ > §, we move to the neighbour u* of u in Ti; trees 
Tj(T, it*) have nn, ni2 and n2 + n% taxa respectively where nu, nn are the numbers of taxa in the subtrees 



2ii,Ti2 of T\ created by removing u* . Since 712 +?i3 < f , the component with size over | must be either Tn 
or Ti2, which are smaller than T\ since they are its subtrees. 

Now, consider the node u we have found by this process, and let t\ and ti be the two largest ti(T,u) 
subtrees, both of which have between n/4 and n/2 taxa. If t\ has more than n/3 taxa, consider the three 
child subtrees in t\ of the neighbour of u in t\\ one has zero taxa, so the larger must have at least n/6 taxa. 
If this tree has at most n/3 taxa, we have found our subtree A; if not, we move one step more away from u 
until we find a subtree small enough. We analogously find B as a subtree of t^- 

Lemma 2. The number of node queries asked by the phytogeny algorithm to assign taxon "Ki+i to its place in 
the tree is at most 37(log 6 / 5 i) ~ 203 lni, with probability 1 — o(l/i 4 ), and at most 37(log 6 / 5 n) with probability 
1 -o(l/n 4 ). 

Proof. Consider the process of adding 7Tj+i to the tree. We consider a sequence y\ ■ ■ ■ yk of nodes in the search 
tree Y, each corresponding to a subtree r(yj) of the existing phylogeny. We divide the yj into phases: phase 

t corresponds to the period in which r(yj) contains between |*i and |* i leaves; after log 6 / 5 i phases, the 
algorithm has found where to put 71$. We show that the distribution of the length of each phase is bounded 
above by the sum of three geometrically-distributed random variables. 

Each phase corresponds to taking a subtree and shrinking it by a factor of 5/6. This happens either if the 
largest of the three subtrees of the phylogeny descendant from the current search tree node yj has at most 
5/6 of the number of taxa we had at the beginning of the current phase, or if 7^ belongs in a tree with fewer 
than that many taxa. We concern ourselves only with the first of these ways of ending a phase, so we upper 
bound the length of a phase. 

The queries asked include taxa found in r(yj), in the order that they occur in permutation tt. In particular, 
we will ask a node query including a node of A with probability at least 1/6 at step, independently, until we 
finally do ask a query of a node from A. (Since our queries always include at least 5/6 of the taxa, and we have 
not queried any members of A, we always have all members of A available.) After querying a member of A, 
for the phase to continue, we must choose the subtree containing all of B. Now, we ask queries corresponding 
to the current subtree, until we see a taxon from B, which will happen with probability 1/6 or greater at 
each step. Now, we arrive in a state where the current subtree of the phylogeny includes border nodes inside 
A and B, since we must have cut off parts of A and of B, but cannot have cut off all of either without ending 
the phase. Now, we ask queries until we see a node from neither A nor B; this happens with probability at 
least 1/3 at each step. Then, the current search tree node yj must correspond to a node on the edge from A 
to B in the phylogeny, since otherwise one of its subtrees would have three border nodes. 

Thus, the length of a phase is at most the sum of three geometric random variables, with expectations 
6, 6 and 3; we then move to a new tree with at most 5/6n taxa. However, it may have two border nodes as 
well; we label these with a taxon from their neighbouring subtrees (thereby adding two taxa to the current 
subtree) and perform a single quartet query (removing at least two taxa). This gives a new subtree in which 
we can perform the next phase. 

Thus, if G(i) are independent geometric random variables with mean i, then the length of one phase is 
bounded above by G(6) + G(6) + G(3) + 1, and the expected total number of queries is at most 19(log 6 / 5 + 
where for simplicity, we let the G(i) all have mean 6. 

Moreover, this variable is rarely above 371og 6 / 5 i. In particular, let Q(n,r) be the negative binomial 
random variable that is the sum of n geometrically distributed variables with mean r. Then Pr[Q(n,r) > 
knr] — Pr[B(knr 7 1/r) < n], where B(n,p) is a binomial random variable that results from the sum of n 
independent Bernoulli trials, each with mean p. By standard Chernoff methods ([B], p. 6), this probability is 
bounded above by exp( ~ fcn ^ 1 2 ~ 1 ^ fc ^ ). So, Pr[Q(31og 6 / 5 i, 6) > 361og 6 / 5 i] < i~ 4 , meaning that the probability 
we use more than 371og 6 / 5 i queries for taxon 7Ti+i is o(l/i 4 ); similarly, the probability that we use more than 
371og 6 / 5 n queries for taxon 7Tj +1 is o(l/n 4 ). 

We emphasize that Y(T) is almost surely balanced regardless of the topology of T. Even if the diameter 
of T is 0(n), its corresponding search tree almost surely has height O(logn). We conjecture that the actual 
values of the constants are much smaller than mentioned in the above lemma. 

5 Accounting for errors 

Our search tree algorithm adapts to the case of error-prone quartets where each quartet query independently 
errs with probability p > 0. We assume that (1 — p) 3 > 0.5 + e for some e > 0; we relax this assumption at 
the end of the section. 



5.1 Random walk in the search tree 

Let Y(T') be a complete search tree for T' and let i be a taxon not in T". We will perform a random walk 
on Y(T') to place x into its proper place in T' , where each step of the random walk is determined by at most 
3 quartet queries. 

Let yi be the location of the random walk after i steps, with y the root of Y(T'). If y(i) is not a leaf 
node, query the border nodes of r(t/j). If any border node queries gives answer x ^ r(t/j), go to the parent 
node of yi. If all border nodes give answers consistent with x € r(yt), query the node yt and descend to the 
child of yi indicated. 

If yi is a leaf, corresponding to an edge of T", let it have counter variable c initially set to 0. Query its 
border nodes as before; if each is consistent with x £ r(y), increment c. Otherwise, decrement it if it is greater 
than 0; if c = 0, move to the parent node of y. After a number of queries we will soon compute, we are at a 
node in Y(T'): if it is a leaf, add x to that node of the search tree as for the insertion algorithm with error-free 
data. If not, signal failure. 

The algorithm finds the proper place in the tree with high probability. Let y x be the leaf in the search 
tree where we should insert taxon x. After i steps in the random walk, let the random variable di be the 
distance in the search tree between t/j and y x . Let the random variable gi have value — c if y x = yi, di + c if 
Vx 7^ Hi and yi is a leaf of Y(T'), and di if yt is not a leaf. If ^ < 0, then the current node of the random 
walk is the correct place to put x. The following simple observation is essential to proving the correctness of 
our algorithm. 

Lemma 3. Consider the random variables gi defined above. 
1. E[ 9l ] < d Q + (I- 2(1 -p) 3 )i. 

2 - if* > i-2 { t pr > then p r[y* * v*\ < cxp r^ 1 -^ 1 -^ 2 ) 

Proof. At each step of the random walk, there are at most two border nodes, so at most three queries. If 
each gives a correct answer, gi decreases by 1; if any incorrect queries occur gi increases by at most one, 
though it might still decrease by 1. In the worst case, the probability that gt decreases is at least (1 — p) 3 , so 
E[g(i + 1) — g(i)] < —(1 — p) 3 + (1 — (1 — p) 3 ) = 1 — 2(1 —p) 3 . The result follows from linearity of expectation, 
since go = do. The second claim follows from the Chcrnoff bound, as the queries are independent. 

Now, we have a straightforward taxon insertion algorithm. For each taxon 7r i+1 , we run the random walk 
long enough to handle the case that go = 203 lnz. To make the error probability at most (1/i 2 ), we require 
that the random walk have j steps, where exp( ~^ 03ln '" l "^ 2 1 r^ 1 ~^ ^ ) < ^. The minimum value of j to make 

this guarantee is j > fclnn, for k = 203 ( 1 2 < 1 ^J* 2 ,^ 2 ^^ 203 ^ 1-2 ^ 1 P Q . 
We can now state the taxon insertion procedure in detail. 



Algorithm 1 InsertTaxon(cc, T, Y(T)) 

Initialize the random walk at the root of Y(T). 
for i = 1 to fclogn do 

Simulate the next step of the random walk, 
end for 

Let yk log n be the current node of the random walk, 
if yk log n is a leaf then 

Attach x to r(y k io S n) in T and update Y(T). 
else 

return Failure, 
end if 



Assuming that the tree Tj_i is correct, then, this algorithm adds a new taxon in O(logi) queries, with 
error or failure probability 0(l/i 2 ). 

5.2 Finding quartets to ask 

We must ensure that we can always find a quartet that has not been queried before in O(l) time. This requires 
two separate conditions to hold: first, that enough such quartets exist, and second, that we can hnd them in 
0(1) time. 



The first of these is easy, as long as we start with a constant-sized guide tree T$ on a set S of at least m 
taxa, where m is the smallest number such that klogm < m — 2, with k equal to the multiple of logi found 
using the formula in the previous section. In each insertion phase, we use at most fclogi quartets at any node 
of the search tree; the extreme case is where the three child subtrees of the current tree T have 1,1, and i — 2 
taxa in them. 

The latter is more complicated. Assume that for each node y in Y, £j(y) is the list of all taxa in the child 
subtree tj(r(y), s(y)) (for j = 1,2,3). To find the next quartet in O(l) time, we must fetch the next taxon 
in tj(T, s(y)) in 0(1) time. We first enumerate taxa in £j(y). Once all taxa in £j(y) have been used, we pick 
the border node bj(y) of y in tj(T, s(y)) (if it exists). The node bj(y) is associated with some ancestor yi of y 
and we have r(y) C ti(r(yi),bj(y)) for some i. Taxa in £( i+ i) mod3 (yi) U £( i+ 2) m odd,{Vi) are also in *j( T ; s (2/)) 
so we enumerate them. Once they have been used, we find border nodes of r{y\) such that two of their taxa 
lists contain taxa in tj(T, s(y)) that have not been used so far. Once all taxa from a node y^ have been used, 
we look at border nodes of r(yi). This process can be thought of as breadth first search on a directed graph 
where an arc denotes the relationship of being a border node. We leave details to the longer version of this 
paper. 

Now, we give the complete algorithm. First, pick a constant-sized set S C S of m taxa and find the 
phylogcny for S consistent with the most quartets. Then iteratively add taxa to the tree using the procedure 
Insert Taxon described above. 



Algorithm 2 Reconstruct^, m) 
Pick a subset S C S with m taxa 

Find phylogeny T on S consistent with the most quartets by exhaustive search. 
Build a search tree Y(T) for T. 
for all s G S\S do 

insertTaxon(s,T,Y(F)) 
end for 



The running time of this algorithm is 0(n log n) with high probability. The error probability can be 
bounded by /x(m) + Y^i=m P"' wnere M( m ) is the probability that the maximum quartet compatibility tree 
on a random set of m taxa is not consistent with T. This quantity is constant for constant m; in the next 
section we show how to make the total error probability o(l) as n grows. 

The remaining case where (1 — p) 3 < \ can be solved by redefining node queries. Each node query is 
now implemented by asking c p queries and returning the majority direction, with constants c p and C chosen 
appropriately. We defer details to the longer version of this paper. 

6 Shrinking the error probability to o(l) 

The algorithm presented in the previous section errs with constant probability, since it starts with a constant- 
sized tree that may have errors, and since the additions to this tree also have constant probability of error. 

If we start with a non-constant-sized guide tree, we can reduce the error probability. The main lemma is 
in the next subsection. 

Theorem 2. The algorithm Reconstructs, max( [log log n\ , m) both returns the correct tree and runs in 
0(n log n) time with probability 1 — o(l). 

Proof. The exhaustive search step requires enumerating all (9((loglogn) 4 ) quartets, on all 0((loglogn)! logn) 
topologies on loglogn taxa; the product of these is O((loglogn) 4+loglog ™ logn), which is sublinear in n. We 
have already shown that the rest of the algorithm requires O(nlogn) time with high probability. 

We will show below that /z(log log n), the failure probability of the guide tree algorithm, is o(l). The failure 
probability of the insertion procedure is at most SiLiogiogn h> wmcn i s Q( logl 1 ogra ), and so o(l). As such, the 
overall failure probability is o(l), as desired. 

We note that the guide tree could have more or fewer than log log n taxa; we merely require that the brute 
force guide tree construction requires O(nlogn) time and has o(l) error probability. 



6.1 Maximum quartet consistency is consistent 

Here, we show that the maximum quartet consistency approach is consistent for our error model. This result 
(which may be of independent interest, as our error model has been studied before [10])) shows that fi(n) — > 
as n grows. 

Theorem 3. Let T mqc be the phytogeny compatible with the most quartet queries for a set of n taxa and let 
T* be the true phylogeny. If each quartet query errs independently with probability p, then ji{n) = Pi[T mqc ^ 
T*] = o(l) as n — > oo. 

To prove this theorem, we first show a few properties of quartets. 

Definition 2. The quartet distance cIq(T,T') of phytogenies T andT' on the same set of taxa is the number 
of quartets on which T and T' differ. 

This distance was studied in |2I3) among others. 

Lemma 4. The quartet distance between distinct phytogenies is at least n — 3. 

Proof. Let T and T' be distinct phylogenies. Let (Si, S2) be a split in T not present in T' . Let (S[, S' 2 ) be a 
split in T not present in T where none of the sets A = Si D S{, B = Si D S' 2 , C = S 2 D S[, D = S 2 n S' 2 is 
empty; such a split exists since T and T' are distinct. Choose taxa a, b, c, d from sets A, B, C, D, respectively. 
The quartet induced by T is ab\cd, whereas in T' it is ac\bd. This gives = |A||i?||C||-D| conflicting quartets; 
(j> is at least n — 3 since \A\ + \B\ + \C\ + \D\ = n, and the product is minimized when \A\ = n — 3 and 
\B\ = \C\ = \D\ = 1. 

The number of trees with small quartet distance from a fixed tree T is small. 

Definition 3. A taxon reinsertion (TR) operation consists of deleting a taxon from a phylogeny and attaching 
it to a remaining edge, creating three new edges. 

Lemma 5. Let T and T' be phylogenies such that dQ (T, T') < n log 2 n. The number of TR operations required 
to transform T into T' is at most clog n for some constant c. 

Proof. Let (Si, S'2) be a split of T not present in T'. Let (S x , S 2 ) be some split in T" that is not present in T 
that minimizes cj) = \A\\B\\C\\D\ as defined earlier. Without loss of generality, assume that A is the largest 
of the sets. Observe that each of the sets B, C, D must have at most log 2 n taxa: otherwise <f> > nlog 2 n, so 
d Q (T,T') > nlog 2 n. We delete all taxa in B and C from both T and T to create trees and T'W. By 
LemmaWl this erases at least n — 3 conflicting quartets. We pick splits (Si, S 2 ) and (S[, S' 2 ) in T^ and T'W 
as we previously did for the original trees and repeat the process to obtain trees T^> and T'( 2 \ this time 
removing at least n — 2 log 2 n — 3 discordant quartets. 

We iterate the process until TW = T'W for some i, which is 0(log n) since the total number of conflicting 
quartets is at most nlog 2 n, and each iteration erases f2(n). The sets B and C have at most log 2 n taxa at 
each step of the algorithm. Therefore, at most 0(log 4 n) taxa are deleted from both trees. 

Let R be the taxa removed. The restrictions of both T and T' to S — R are the same. To transform T 
to T' , we move all nodes in R to a new side of the tree T, and then move each to the proper place in T' in 
0(log 4 n) TR operations. 

Corollary 1. For any phylogeny T, the number of phylogenies T' such that cIq(T,T') < nlog 2 n is at most 
n biog n for a large enough constant b. 

Proof. Each T' with distance from T at most nlog 2 n can be obtained from T by clog 4 n TR operations. 
For any tree, the number of ways to perform a TR operation is less than 2n 2 since we can choose any of the 
n taxa and reinsert it at any of the 2n — 5 edges other than the one at which it was before the operation. 
This gives fewer than (2n 2 ) clog ™ phylogenies that can be created by repeating the operation clog 4 n times. 
Taking b — 4c finishes the proof. 

Now we can prove the maximum quartet compatibility consistency theorem. 

Proof. Suppose some tree T' is consistent with more quartets than T* , and c?q (T", T* ) = q. At least half of the 
q quartets where T* and T' differ must be erroneous; since they are independent errors, this has probability 
at most exp(— q(^— |^-)) by the Chernoff bound. 



Let 7o be the set of all incorrect phylogenies with quartet distance from T* less than n log 2 n . Then 
|T°| < n blog n , and for trees in To, Lemma UJ gives that q > n — 3. The probability that any tree in Tq is 

consistent with more queries than T* is bounded by n b log4 " exp(- (n - 3) ( (1 ~ 2p)2 ) ) , which is o(l) as n grows. 

Now, consider the incorrect phylogenies 7i that are not in 7o- There are fewer than 2™n! < 2™( 1+logTl ) 
such topologies, and for each, d q {T,T*) > nlog 2 n. The probability that any tree in T 1 is consistent with 

more quartets than T* is bounded above by 2™( 1+logn ) exp(— nlog 2 n( — t^—)), which is o(l) as n grows. 

So the probability that any incorrect tree is consistent with more quartets than T* converges to as n 
grows. 

7 Experiments 

We have developed a prototype implementation of our algorithm to investigate its running time and properties. 
We have tested the algorithm in three scenarios. First, we tested the performance of the algorithm for the 
case with no errors. Second, we tested the performance of the random walk algorithm when the data was 
generated according to the model with independent errors. Finally, we ran the random walk algorithm on 
real biological datasets. 

The tree topologies used in the synthetic data sets were chosen at random from the uniform distribution. 
In the iid error case, every quartet query gave one of the two possible wrong answers with probability p. In 
our experiments, we set p = 0.1. 

The algorithm for error-free data is very fast even for reasonably large phylogenies. For data sets having 
10000 taxa or less, constructing the tree takes less than a second. For 20000 taxa, it takes roughly 2 seconds. 

The random walk algorithm is roughly 5 times slower than the algorithm for error-free data. Constructing 
a tree having 10000 taxa takes about 5 seconds, whereas a tree with 20000 taxa requires 9 seconds. 

Table 1. The running times of the algorithm for the error-free and iid data sets 



Algorithm 


1000 


5000 


10000 


20000 


Error-free 


< Is 


< Is 


< Is 


2s 


Random walk 


< Is 


2s 


5s 


9s 



We ran the algorithm on several protein families from the Pfam database pQ. Quartet queries were answered 
with the Four-Point method j]5] based on estimated evolutionary distances between sequences. Distances were 
estimated based on pairwise BLOSUM62 scores using a method by Sonnhammer and Hollich [17]. We used 
neighbor-joining trees on a subset of 150 sequences (chosen at random from the whole set of sequences) as our 
initial guide trees. Our prototype implementation was able to process a dataset of around 12000 sequences in 
about 16 minutes (see Table [2]). 



Table 2. The running times of the algorithm for several Pfam families 



Protein family 


Sequences 


Average length 


running time 


Maf(PF02545) 


1980 


189.60 


38s 


2Oxoacid_dh(PF00198) 


3701 


225.10 


lm49s 


PALP(PF00291) 


11815 


294.40 


15m42s 



In all our experiments, the height of search trees constructed by the algorithm was less than 40. This 
supports our view that the constants in Lemma [2] can be improved. 

8 Conclusion 

We have presented a fast algorithm that is guaranteed to reconstruct the correct phylogeny with high prob- 
ability under an error model where each quartet query errs with a fixed probability, independently of others. 



The algorithm runs in 0(n log n) time, which is the lower bound for any phylogeny reconstruction algorithm. 
Our prototype implementation seems reasonably fast on both real and simulated datasets. 

This work could be extended in many directions. From a theoretical perspective, it is interesting whether 
there exist fast algorithms that offer similar performance guarantees under commonly studied models of 
sequence evolution, such as Jukes-Cantor or Cavender-Farris. 

From a practical perspective, it would be interesting to compare the results of our algorithm to others. 
We plan to extend our algorithm to make use of additional information such as the length of the middle 
edge in reconstructed quartets. This would enable the algorithm to distinguish between more credible and 
less credible queries, which may lead to an overall performance improvement. Another way to improve the 
algorithm is by improving the procedure of finding new quartets to ask so as to minimize the correlation 
between errors. 



References 

1. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths- Jones, S., Howe, K.L., Marshall, 
M., Sonnhammer, E.L.L.: The pfam protein families database. Nucleic Acids Research 30(1), 276-280 (2002) 

2. Brodal, G., R, R.F., Pedersen, C: Computing the quartet distance between evolutionary trees in time 0(n log n). 
Algorithmica 38, 377-395 (2003) 

3. Bryant, D., Tsang, J., Kearney, P.E., Li, M.: Computing the quartet distance between evolutionary trees. In: 
Proceedings of SODA 2000. pp. 285-286 

4. Csiiros, M.: Fast recovery of evolutionary trees with thousands of nodes. J. Comp. Biol. 9(2), 277-297 (2002) 

5. Daskalakis, C, Mossel, E., Roch, S.: Phylogenies without branch bounds: Contracting the short, pruning the deep. 
In: Proceedings of RECOMB 2009. pp. 451-465 

6. Dubhashi, D.P., Panconesi, A.: Concentration of measure for the analysis of randomized algorithms. Cambridge 
Univ. Press (2009) 

7. Erdos, P.L., Steel, M.A., Szekely, L.A., Warnow, T.: A few logs suffice to build (almost) all trees: Part II. Theor. 
Comput. Sci 221(1-2), 77-118 (1999) 

8. Feige, U., Peleg, D., Laghavan, P., Upfal, E.: Computing with unreliable information. In: Proceedings of STOC 
1990. pp. 128-137 

9. Felsenstein, J.: Inferring Phylogenies. Sinauer (2001) 

10. Gronau, I., Moran, S., Snir, S.: Fast and reliable reconstruction of phylogenetic trees with very short edges. In: 
Proceedings of SODA 2008. pp. 379-388 

11. Kannan, S.K., Lawler, E.L., Warnow, T.J.: Determining the evolutionary tree using experiments. J. Algorithms 
21(1), 26-50 (1996) 

12. Karp, R.M., Kleinberg, R.: Noisy binary search and its applications. In: Proceedings of SODA 2007. pp. 881-890 

13. King, V., Zhang, L., Zhou, Y.: On the complexity of distance-based evolutionary tree reconstruction. In: SODA, 
pp. 444-453 (2003) 

14. Price, M.N., Dehal, P.S., Arkin, A. P.: FastTree: Computing large minimum evolution trees with profiles instead 
of a distance matrix. Mol. Biol. Evol. 26(7), 1641-1650 (2009) 

15. Ranwez, V., Gascuel, O.: Quartet-based phylogenetic inference: Improvements and limits. Mol. Biol. Evol. 18(6), 
1103-1116 (2001) 

16. Snir, S., Warnow, T., Rao, S.: Short quartet puzzling: A new quartet-based phylogeny reconstruction algorithm. 
Journal of Computational Biology 15(1), 91-103 (2008) , jhttp : //dx . doi . org/10 . 1089/cmb . 2007 . 0103| 

17. Sonnhammer, E.L.L., Hollich, V.: Scoredist: A simple and robust protein sequence distance estimator. BMC 
Bioinformatics 6, 108 (2005) 

18. Strimmer, K., von Haeseler, A.: Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree 
topologies. Mol. Biol. Evol. 13(7), 964-969 (1996) 

19. Wu, G., Kao, M.Y., Lin, G., You, J.H.: Reconstructing phylogenies from noisy quartets in polynomial time with 
a high success probability. Alg. Mol. Biol. 3 (2008) 

20. Wu, G., You, J.H., Lin, G.: Quartet-based phylogeny reconstruction with answer set programming. IEEE/ACM 
Trans. Comput. Biol. Bioinf. 4(1), 139-152 (2007) 



